[My Github link] https://github.com/sydneydlu98/DSI_Data_Challenge_5

Loading/Cleaning Data and Exploratory Analysis

In this Data Challenge, we will be clustering foods from the nndb_flat dataset provided on Canvas. To load/clean the data as well as perform some exploratory analysis:

  1. Read in the data
## load all the packages
library(readr)
library(dplyr)
library(GGally)
library(tidyverse)
library(plotly)
library(ggplot2)

## read in data
data <- read_csv("nndb_flat.csv")
  1. We will be dealing with only data that falls under the food groups of Vegetables and Vegetable Products, Beef Products, and Sweets. Filter the data to contain only these food groups.
## filter the data to only contain food groups of Vegetables and Vegetable Products, Beef Products, and Sweets
object <-
  c("Sweets", "Beef Products", "Vegetables and Vegetable Products")

## subset the data
clean_data <- data %>%
  subset(FoodGroup %in% object)
  1. Select only the variables from Energy_kcal to Zinc_mg
## select variables from Energy_kcal to Zinc_mg
var <- clean_data %>%
  select(Energy_kcal:Zinc_mg)
  1. Examine the correlation among the variables using GGally::ggcorr. Which variables have a high correlation?
## examine the correlation
GGally::ggcorr(
  var,
  size = 3.2,
  label = TRUE,
  label_size = 2.7,
  hjust = .9,
  layout.exp = 2
) 

If the coefficient value lies between ± 0.50 and ± 1, then it is said to be a strong correlation. If the value is near ± 1, then it said to be a perfect correlation: as one variable increases, the other variable tends to also increase (if positive) or decrease (if negative).

We can see from the correlation plot, we do not have any high negative correlation, but we do have many high positive correlations. Such as the correlation coefficient between Folate_mcg and Thiamin_mg is 1, which is perfect positive correlation; and others like Protein_g and Zinc_mg have a correlation coefficient of 0.9 which is considered high correlation. As well as the correlation coefficient between carb_g and sugar_g which is 0.8, this is also considered high correlation.

Performing PCA

Steps for performing the PCA on the data:

  1. Perform PCA on the data. Don’t forget to scale the data (if it is appropriate for this application)!
## scale the data 
data_scaled <- scale(var, 
                     center = TRUE,
                     scale = TRUE)
## perform PCA
pca_data <- prcomp(data_scaled,
                   center = FALSE,
                   scale. = FALSE)
  1. Make a plot showing the cumulative proportion of the variation explained by each PC with cumulative variation explained on the y-axis and PC on the x-axis.
## extract the proportion of the variation explained by each PC
var_explainded <- summary(pca_data)$importance[2,]

## calculate the cumulative proportion of the variation explained by each PC
cumulative <- cumsum(var_explainded)

## create the table for cumulative proportion of the variation explained by each PC
var_explained_df <- data.frame(PC = 1:23,
                               var_explainded = var_explainded,
                               cum_var_explained = cumulative
)

## plot the graph for cumulative proportion of the variation explained by each PC
var_explained_df %>%
  ggplot(aes(x = PC, 
             y = cum_var_explained, 
             group = 1)) +
  geom_point() +
  geom_line(lwd = 1) +
  labs(x = 'Number of PCs',
       y = 'Cumulative Variation Explained') +
  ggtitle("Plot of cumulative proportion of the variation explained by each PC") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16)) +
  # add line for better visualization
  geom_vline(
    xintercept = 3,
    linetype = 2,
    lwd = 1,
    col = "red"
  ) 

  1. We will look at the first 3 PCs which explain about 60% of the variation in the data. Note that you may want to look at more depending on what your application is. Make 3 separate plots for the loadings for the first 3 PCs for all of the variables, ordered by the absolute value of the magnitude of the loadings.
## create the data frame for the loadings of PC1, PC2, PC3
pca_loadings <- as.data.frame(pca_data$rotation) %>%
  dplyr::select(PC1, 
                PC2, 
                PC3) %>%
  mutate(variable = rownames(pca_data$rotation))

## Plot for the loadings of PC1 for all of the variables
pc1 <- ggplot(pca_loadings,
              aes(x = reorder(variable,
                              abs(PC1)),
                  y = PC1)) +
  geom_bar(stat = 'identity',
           fill = "#FF6666") +
  theme(axis.text.x = element_text(
    angle = 50,
    hjust = 1,
    size = 13
  )) +
  ggtitle("Plot for the loadings of PC1 for all of the variables") +
  theme(plot.title = element_text(hjust = 0.5,
                                  size = 16)) +
  labs(x = "Variables",
       y = "Loadings")

pc1

## Plot for the loadings of PC2 for all of the variables
pc2 <- ggplot(pca_loadings,
              aes(x = reorder(variable,
                              abs(PC2)),
                  y = PC2)) +
  geom_bar(stat = 'identity',
           fill = "darkgreen") +
  theme(axis.text.x = element_text(
    angle = 50,
    hjust = 1,
    size = 13
  )) +
  ggtitle("Plot for the loadings of PC2 for all of the variables") +
  theme(plot.title = element_text(hjust = 0.5,
                                  size = 16)) +
  labs(x = "Variables",
       y = "Loadings")

pc2

## Plot for the loadings of PC3 for all of the variables
pc3 <- ggplot(pca_loadings,
              aes(x = reorder(variable,
                              abs(PC3)),
                  y = PC3)) +
  geom_bar(stat = 'identity',
           fill = "blue") +
  theme(axis.text.x = element_text(
    angle = 50,
    hjust = 1,
    size = 13
  )) +
  ggtitle("Plot for the loadings of PC3 for all of the variables") +
  theme(plot.title = element_text(hjust = 0.5,
                                  size = 16)) +
  labs(x = "Variables",
       y = "Loadings")

pc3

  1. Make 3 plots of the scores on the PCs colored by food group. Plot the below scores. Make the plots interactive with plotly so you can identify the food description of any outliers.
## make the data frame for the scores on the PCs
pca_scores <- as.data.frame(pca_data$x)

## add 2 columns to make the plots
pca_scores <- pca_scores %>%
  mutate(FoodGroup = clean_data$FoodGroup) %>%
  mutate(description = clean_data$ShortDescrip)
    1. PC1 versus PC2
## make the plot of scores on the PCs (PC1 versus PC2)
plot1 <- ggplot(pca_scores,
                aes(x = PC1,
                    y = PC2,
                    col = FoodGroup,
                    label = description
                )) +
  geom_point() +
  ggtitle("PC1 versus PC2") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16))

ggplotly(plot1)
    1. PC1 versus PC3
## make the plot of scores on the PCs (PC1 versus PC3)
plot2 <- ggplot(pca_scores,
                aes(x = PC1,
                    y = PC3,
                    col = FoodGroup,
                    label = description
                )) +
  geom_point() +
  ggtitle("PC1 versus PC3") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16))

ggplotly(plot2)
    1. PC2 versus PC3
## make the plot of scores on the PCs (PC2 versus PC3)
plot3 <- ggplot(pca_scores,
                aes(x = PC2,
                    y = PC3,
                    col = FoodGroup,
                    label = description
                )) +
  geom_point() +
  ggtitle("PC2 versus PC3") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16))

ggplotly(plot3)

Identify Outlier and Performing PCA Again

  1. There is a major outlier on the plots above – which food is the outlier? Remove the outlier from your data.

The major outlier on the plots above is yeast extract spread from the food group vegetable and vegatable products.

Then we remove this outlier.

## remove the outlier I identified above
complete_data <- clean_data %>%
  filter(ShortDescrip != "YEAST EXTRACT SPREAD")
  1. Perform PCA again on the dataset without the outlier (steps 1-4 in the Performing PCA section above) and look at the loadings of the first 3 PCs. Have these changed? Investigate and comment on what could have caused any changes.
## re-run steps 1-4 in the Performing PCA section above
var_new <- complete_data %>%
  select(Energy_kcal:Zinc_mg)

## scale and center the filtered data without outlier
data_scaled_new <- scale(var_new, 
                         center = TRUE,
                         scale = TRUE)

pca_data_new <- prcomp(data_scaled_new,
                       center = FALSE,
                       scale. = FALSE)

var_explainded_new <- summary(pca_data_new)$importance[2,]
cumulative_new <- cumsum(var_explainded_new)
var_explained_df_new <- data.frame(PC = 1:23,
                                   var_explainded = var_explainded_new, 
                                   cum_var_explained = cumulative_new)

var_explained_df_new %>%
  ggplot(aes(x = PC, 
             y = cum_var_explained, 
             group = 1)) +
  geom_point() +
  geom_line(lwd = 1) +
  labs(x = 'Number of PCs',
       y = 'Cumulative Variation Explained') +
  ggtitle("Plot of cumulative proportion of the variation explained by each PC \n (without outlier)") +
  theme(plot.title = element_text(hjust = 0.5, size = 16)) +
  # add line for better visualization
  geom_vline(
    xintercept = 3,
    linetype = 2,
    lwd = 1,
    col = "red"
  )

pca_loadings_new <- as.data.frame(pca_data_new$rotation) %>%
  dplyr::select(PC1, PC2, PC3) %>%
  mutate(variable = rownames(pca_data_new$rotation))

pc1_new <-
  ggplot(pca_loadings_new, 
         aes(x = reorder(variable,
                         abs(PC1)),
             y = PC1)) +
  geom_bar(stat = 'identity', 
           fill = "#FF6666") +
  theme(axis.text.x = element_text(
    angle = 50,
    hjust = 1,
    size = 13
  )) +
  ggtitle("Plot for the loadings of PC1 for all of the variables \n (without outlier)") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16)) +
  labs(y = "Loadings", x = "Variables")

pc1_new

pc2_new <-
  ggplot(pca_loadings_new, 
         aes(x = reorder(variable,
                         abs(PC2)),
             y = PC2)) +
  geom_bar(stat = 'identity', 
           fill = "darkgreen") +
  theme(axis.text.x = element_text(
    angle = 50,
    hjust = 1,
    size = 13
  )) +
  ggtitle("Plot for the loadings of PC2 for all of the variables \n (without outlier)") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16)) +
  labs(x = "Variables",
       y = "Loadings")

pc2_new

pc3_new <-
  ggplot(pca_loadings_new, 
         aes(x = reorder(variable, 
                         abs(PC3)),
             y = PC3)) +
  geom_bar(stat = 'identity', 
           fill = "blue") +
  theme(axis.text.x = element_text(
    angle = 50,
    hjust = 1,
    size = 13
  )) +
  ggtitle("Plot for the loadings of PC3 for all of the variables \n (without outlier)") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16)) +
  labs(x = "Variables",
       y = "Loadings")

pc3_new

pca_scores_new <- as.data.frame(pca_data_new$x)

pca_scores_new <- pca_scores_new %>%
  mutate(FoodGroup = complete_data$FoodGroup) %>%
  mutate(description = complete_data$ShortDescrip)

plot1_new <- ggplot(pca_scores_new,
                    aes(
                      x = PC1,
                      y = PC2,
                      col = FoodGroup,
                      label = description
                    )) +
  geom_point() +
  ggtitle("PC1 versus PC2 (without outlier)") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16))

ggplotly(plot1_new)
plot2_new <- ggplot(pca_scores_new,
                    aes(
                      x = PC1,
                      y = PC3,
                      col = FoodGroup,
                      label = description
                    )) +
  geom_point() +
  ggtitle("PC1 versus PC3 (without outlier)") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16))

ggplotly(plot2_new)
plot3_new <- ggplot(pca_scores_new,
                    aes(
                      x = PC2,
                      y = PC3,
                      col = FoodGroup,
                      label = description
                    )) +
  geom_point() +
  ggtitle("PC2 versus PC3 (without outlier)") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16))

ggplotly(plot3_new)

By looking at the loadings of the first 3 PCs, they do notice there are things changed. We are able to see the loading of folate_mcg changed from positive loading to almost no loading, it is because the outlier: yeast extract spread has a insanely high value of folate_mcg compares to others, so when we get rid of the outlier, we would expect the loading of folates_mcg in PC1 to go down; The loadings in PC2 have small changes but nothing significant; Then in the loadings of PC3, we are able to observe many big changes, such as vitA_mcg, vitB12-mcg and Manganese_mg change from large negative loadings to large positive loadings, as well as folate_mcg changed from positive loading to almost no loading and Niacin_mg changes from large positive loading to almost no loading. It makes sense because outlier has extremely large value for Folate_mcg and Niacin_mg, so get rid of the outlier should significantly decrease the loadings in PCs.

  1. Describe what you see in the plots of the scores and interpret this in conjunction with the loadings that you observed for the PCs.

There is no outlier anymore in our new plots of scores and we are able to see in 3 biplots, they all have some overlapping between 3 food groups and the distance among these 3 food groups are near, as the nearer the distance are these 3 groups from each other, meaning there are more similarities among them in terms of nutritional elements (Points that are close to each other in the biplot represent observations with similar values).

The orientation (direction) of the vector, with respect to the principal component space, in particular, its angle with the principal component axes: the more parallel to a principal component axis is a vector, the more it contributes only to that PC. We are able to see vegetable and vegetable products are parallel to PC2 which means the more it contributes only to PC2.

Also highly correlated variables point in similar directions; uncorrelated variables are nearly perpendicular to each other. So we are able to see, in the first biplot, many beef products are highly correlated with vegetable and vegetable products in terms of food nutritional elements.

The length in the space which indicates the longer the vector, the more variability of this variable is represented by the two displayed principal components; short vectors are thus better represented in other dimension. So, we are able to see beef products are well represented by each of 3 biplots with PC1, PC2 and PC3 as many beef products have the longest length in each of the plot, and vegetable and vegetable products are well represented by PC1 and PC2 in the first biplot.

Because a PCA biplot combines both the principal component scores and the loading vectors in a single biplot display. We are able to interpret loadings in our PCA biplot. We can see most of the loadings in PC1 and PC2 are positive, so we would expect the majority of points to be situated in the first quadrant (both x-axis and y-axis are positive) of the first biplot (PC1 VS PC2); However, many loadings in PC3 are negative, so we would expect most of points in biplots of PC1 VS PC3 and PC2 and PC3 to be mostly situated in the fourth quadrant (where x-axis is positive and y-axis is negative).